66

Algorithms for Binary Neural Networks

LAdv p

∂Clp

=



i

2(1Dp(T l

p,i; Yp))∂Dp

∂Clp

.

(3.95)

Furthermore,

LData p

∂Clp

= 1

n



i

(RpTp) ∂Tp

∂Clp

.

(3.96)

The complete training process is summarized in Algorithm 4, including the update of the

discriminators.

Algorithm 4 Pruned RBCN

Input: The training dataset, the pre-trained 1-bit CNNs model, the feature maps Rp from

the pre-trained model, the pruning rate, and the hyper-parameters, including the initial

learning rate, weight decay, convolution stride, and padding size.

Output: The pruned RBCN with updated parameters Wp, ˆWp, Mp and Cp.

1: repeat

2:

Randomly sample a mini-batch;

3:

// Forward propagation

4:

Training a pruned architecture // Using Eq.17-22

5:

for all l = 1 to L convolutional layer do

6:

F l

out,p = Conv(F l

in,p, ( ˆW l

p Mp)Cl

p);

7:

end for

8:

// Backward propagation

9:

for all l = L to 1 do

10:

Update the discriminators Dl

p(·) by ascending their stochastic gradients:

11:

Dlp(log(Dl

p(Rl

p; Yp)) + log(1Dl

p(T l

p; Yp)) + log(Dl

p(Tp; Yp)));

12:

Update soft mask Mp by FISTA // Using Eq. 24-26

13:

Calculate the gradients δW l

p; // Using Eq. 27-31

14:

W l

p W l

p ηp,1δW l

p; // Update the weights

15:

Calculate the gradient δCl

p; // Using Eq. 32-36

16:

Cl

p Cl

p ηp,2δCl

p; // Update the learnable matrix

17:

end for

18: until the maximum epoch

19: ˆW = sign(W).

3.6.4

Ablation Study

This section studies the performance contributions of the kernel approximation, the GAN,

and the update strategy (we fix the parameters of the convolutional layers and update the

other layers). CIFAR100 and ResNet18 with different kernel stages are used.

1) We replace the convolution in Bi-Real Net with our kernel approximation (RBConv)

and compare the results. As shown in the column of “Bi” and “R” in Table 3.3, RBCN

achieves an improvement in accuracy 1.62% over Bi-Real Net (56.54% vs. 54.92%) using

the same network structure as in ResNet18. This significant improvement verifies the effec-

tiveness of the learnable matrices.

2) Using GAN makes RBCN improve 2.59% (59.13% vs. 56.54%) with the kernel stage

of 32-32-64-128, which shows that GAN can help mitigate the problem of being trapped in

poor local minima.